No. Collectors did not choose randomly, but prefered rarer types over more common ones.
It is a sample of convenience.
Thus, there is bias in every year’s sample because rare types are over represerented. Further this bias might change across years as the frequencies of the two morphs have changed over time.
Discrete.
Technically it is a discrete variable (because fractions can be enumerated / are retricted to fintely many values in a fintie sample) but it makes sense to treat it as a continous variable if the sample is very large.
Discrete. There are no half crimes.
Continous. Body mass is continous and hence the log as well.
E(xplanatory): Altitude (categorical: high vs low) R(esponse): Growth rate S(tudy type): Observational
E: Treatment (standard vs. tasploglutide) R: Rate of insulin release S: Experimental
E: Health status (schizophrenia vs. healthy) R: Frequency of illegal drug use S: Observational
E:Number of legs R: Survival propability S: Experimental
E: Treatment (advanced communication therapy vs. social visits without formal therapy) R: Communication ability S: Experimental
The main problem is a strong bias in the sample. To see this, consider the sample of planes that were used in this study. Only planes that were not hit in critical areas were available to estimate the distribution of bullet holes. Planes that were hit in a critical area, i.e., one that leads to a crash, were not available because they did not return to base. With this knowledge, it becomes clear that it would have been better to reinforce the areas were no, or very little, bullet holes were found, namely the cockpit and engine.
The population parameter being estimated is all the small mammals of Kruger National Park.
No, the sample is not likely to be random. In a random sample, every individual has the same chance of being selected. But some small mammals might be easier to trap than others (for example, trapping only at night might miss all the mammals active only in the day time). In a random sample individuals are selected independently. Multiple animals caught in the same trap might not be independent if they are related or live near one another (this is harder to judge).
The number of species in the sample might underestimate the number in the Park if sampling was not random (e.g., if daytime mammals were missed), or if rare species happened to avoid capture. Even if the sample is perfectly random, the estimator will underestimated the true number in most cases. You can sample fewer species just by chance, but not more species as there actually are. Thus - on average ??? you will underestimate the true number.
You should see a blank plot. This is what ggplot does, it simply creates a “canvas” to which you can add “layers”.
How many rows are in mpg? How many columns?
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
There are 234 obsservations of 11 variables. thus, there are 11 cloumns and 234 rows. we can confirm this by looking at the dimension of the underlying data matrix:
dim(mpg)
## [1] 234 11
?mpg
drv is a categorical vaiable indicating the drive type: f = front-wheel drive, r = rear wheel drive, 4 = 4wd
ggplot(data = mpg) +
geom_point(mapping = aes(y = hwy,x = cyl)) +
theme_classic() +
labs(title = "A scatterplot", y = "miles per gallon on highways",x="cylinders")
ggplot(data = mpg) +
geom_point(mapping = aes(x = class,y = drv)) +
theme_classic() +
labs(title = "A useless plot", x = "class",y="drive type")
both drv and class are categorical variable and it does not make sense to visualize them this way. What would be a better way to show these data?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
A continous variable will lead ot a continous color gradient rather than a discrete set of colors.
ggplot(data = mpg) +
geom_point(mapping = aes(x = cty, y = hwy, color = displ))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
genes <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter04/chap04e1HumanGeneLengths.csv"))
# a) How many individuals are in the sample (i.e., what is the sample size, n)?
n = dim(genes)[1]
print(n)
## [1] 20290
# b) What is the sum of all of the observations?
sum.gene.lengths = sum(genes$geneLength)
# c) What is the mean of this sample?
mean.gene.length1 = sum.gene.lengths/n
mean.gene.length2 = mean(genes$geneLength)
print(mean.gene.length1)
## [1] 2622.027
print(mean.gene.length2)
## [1] 2622.027
# d) What is the sum of the squares of the measurements?
square.sum.gene.lengths = sum(genes$geneLength^2)
print(square.sum.gene.lengths)
## [1] 223678143206
# e) What is the variance of this sample?
# by hand:
variance.gene.length = sum((genes$geneLength - mean.gene.length1)^2)/(n-1)
print(variance.gene.length)
## [1] 4149235
# for comparison:
var(genes$geneLength)
## [1] 4149235
# f) What is the standard deviation of this sample?
sd(genes$geneLength)
## [1] 2036.967
# or
sd.gene.length = sqrt(variance.gene.length)
# g) What is the coefficient of variation for this sample?
cv.gene.length = sd.gene.length/mean.gene.length1
print(cv.gene.length)
## [1] 0.7768672
# h) Display the data using different plotting techniques. Which one illustrates the data best?
# Which one do you like best?
ggplot(data = genes) +
geom_histogram(mapping = aes(x = geneLength),color="black",fill="darkred") +
theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = genes) +
geom_density(mapping = aes(x = geneLength),color="black",fill="darkred") +
xlab("Länge der Gene") +
ylab("Häufigkeit") +
scale_x_log10() +
theme_classic()
ggplot(data = genes) +
geom_boxplot(mapping = aes(y = geneLength),color="black",fill="darkred") +
theme_classic() +
coord_flip()
Use the stickleback dataset and calcualte the median and mean for each genotpye. Make a histogramm and add vertical lines to the histogramm indicating the mean and median for each genotype. What do you see? How does the shape of the distribution affect differences between mean and median?
# load the data file:
stickleback <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter03/chap03e3SticklebackPlates.csv"))
# one possible solution:
mean.mm = mean(stickleback$plates[stickleback$genotype=="mm"])
mean.mM = mean(stickleback$plates[stickleback$genotype=="Mm"])
mean.MM = mean(stickleback$plates[stickleback$genotype=="MM"])
median.mm = median(stickleback$plates[stickleback$genotype=="mm"])
median.mM = median(stickleback$plates[stickleback$genotype=="Mm"])
median.MM = median(stickleback$plates[stickleback$genotype=="MM"])
par(mfrow = c(1,3))
hist(stickleback$plates[stickleback$genotype=="MM"],
main = "MM",xlab = "number of plates",
col="darkred")
abline(v = mean.MM,col="black")
abline(v = median.MM,col="blue")
hist(stickleback$plates[stickleback$genotype=="Mm"],
main = "Mm",xlab = "number of plates",
col="darkred")
abline(v = mean.mM,col="black")
abline(v = median.mM,col="blue")
hist(stickleback$plates[stickleback$genotype=="mm"],
main = "mm",xlab = "number of plates",
col="darkred")
abline(v = mean.mm,col="black")
abline(v = median.mm,col="blue")
# alternatively, we can first create a data frame
# with the means per genotype
# check out the help function of ddply
# (this is only one way to do this, you can also do it by
# "hand" or using other functions such as tapply) :
library(dplyr)
df.mean = ddply(stickleback, "genotype", summarize, m.number = mean(plates))
df.median = ddply(stickleback, "genotype", summarize, m.number = median(plates))
# then we plot it with a single ggplot
ggplot() +
geom_bar(data = stickleback, mapping = aes(x=plates),binwidth=5,color="black",fill="darkred") +
geom_vline(data = df.mean, aes(xintercept=m.number),col="black",size=1) +
geom_vline(data = df.median, aes(xintercept=m.number),col="blue",size = 1) +
facet_wrap(~ genotype) +
ylab("frequency") +
theme_classic()
## Warning: Ignoring unknown parameters: binwidth
Write a short R script that simulates the sampling distribution of the mean. Plot the sampling distribution and add the theoretical expectation to the plot. Follow the following steps:
# We choose the Poisson distribution with mean lambda = 50
lambda = 50
# We do 1000 replicates
reps = 1000
# We choose the sample size n to be 10
sample_size = 10
# the means of each sample will be stroed in this vector
means = vector("numeric",reps)
# in this loop, we will calcualte the means of the sample
for (i in 1:reps)
{
# take a sample of the poisson distribution
sample = rpois(sample_size,lambda)
# store the mean in our vector
means[i] = mean(sample)
}
# a histogram - note that we set freq = FALSE to
# show realtive frequencies
hist(means,freq=F,main="The sampling distribution of a Poisson distribution")
# we specify the x values for the theoretical poisson
x = seq(0,3*lambda,by=0.1)
#we calcualte the standard error:
SE = sqrt(lambda)/sqrt(sample_size)
# caclualte the corresponding PDF of the poisson
y = dnorm(x,lambda,SE)
lines(x,y)
The pizza below, ordered from the Venn Pizzeria on Bayes Street, is divided into eight slices.
Figure: Venn Pizza
Answer the following questions based on the drawing of the pizza:
\[P(pepperoni) = 5/8 = 0.625 = 62.5 \%\]
\[P(pepperoni \quad and \quad anchovies) = 2/8 = 25 \%\]
\[P(pepperoni \quad or \quad anchovies) = 7/8 = 0.875 = 87.5\%\]
No. There are two slices with both pepperoni and anchovies.
Yes. There is no slice that has both olives and mushrooms on it.
No, because
\[P(anchovy) = 4/8 = 0.5\]
\[P(mushroom) = 3/8 = 0.375\]
and
\[P(mushroom \quad and \quad anchovy) = 1/8\]
but
\[P(mushroom)*P(anchovy) = 3/8 * 4/8 = 3/16 = 0.1875 = 18.75 \% .\]
\[P(anchovy \mid olive) =\] \[P(anchovy \quad and \quad olive)/P(olive) = (1/8) / (2/8) = 1/2 = 0.5 = 50\%\]
\[P(olive \mid anchovy) =\] \[P(olive \quad and \quad anchovy)/P(anchovy) = (1/8) / (4/8) = 1/4 = 0.25 = 25\%\]
\[P(last \quad slice \quad has \quad olives) = 2/8 = 0.25 = 25\%\].
This can be seen directly because each slice has the same probability to be the last slice.
\[P(first \, slice \,has \, olives) = 2/8\]
\[P(second \, slice \, also \, has \, olives) = 1/7\]
\[P(both \, slices \, have \, olives) = 2/8 * 1/7 = 1/28 \approx 3.6\% \]
\[P(no \, pepperoni) = 1 – P(pepperoni) = 3/8 = 37.5\%\]
A pizza for which mushrooms, olives, anchovies and pepperoni are all mutually exclusive
\[P(no \, dangerous \, snakes) = \] \[ P(no \, dangerous \, snake \, in \, left \, hand) P(no \, dangerous \, snake \, in \, right \, hand)\] \[= 3/8 * 2/7 = 6/56 = 0.107 = 10.7 \%\]
\[P(bite) = P(bite \mid 0 \, dangerous \, snakes) P(0 \, dangerous \, snakes)\] \[+ P(bite \mid 1 \, dangerous snakes) P(1 dangerous \, snakes) \] \[+P(bite \mid 2 \, dangerous snakes) P(2 \, dangerous \, snakes)\]
\[P(0 \, dangerous \, snakes) = 0.107\] (from (a)) \[P(1 \, dangerous \, snake) = (5/8 3/7) + (3/8 5/7) = 0.536\] \[P(2 \, dangerous \, snakes) = 5/8 * 4/7 = 0.357\] \[P(bite \mid 0 \, dangerous \, snakes) = 0\] \[P(bite \mid 1 \, dangerous \, snake) = 0.8\] \[P(bite \mid 2 \, dangerous \, snakes) = 1- P(no \, bite \mid 2 \, dangerous \, snakes) = 1-(1- 0.8)^2 = 0.96\]
Putting al this together: \[P(bite) = (0*0.107)+(0.8*0.536)+(0.96*0.357) = 0.772\]
\[P(defanged \mid no \, bite)= \left[P(no \, bite \mid defanged)P(defanged)\right]/P(no \, bite)\] \[P(no \, bite \mid defanged)=1\] \[P(defanged) = 3/8\]
$$P(no \, bite) = P(defanged)P(no bite \mid defanged) $$
$$+ P(dangerous)P(no \, bite \mid dangerous) = (3/8 *1) + (5/8 (1-0.8)) = 0.5$$
Therefore
$$P(defanged \mid no \, bite) = (1*3/8 )/ (0.5) = 0.75 = 75\%.$$
### Exercise 12
a) What is the probability that all five researchers have calculated an interval that includes the true value of the parameter?
\(0.95^5 = 0.774\)
b) What is the probability that at least one does not include the true parameter value.
\(1- 0.95^5 = 0.226\)
## Week 6 (01.04.2020)
### Exercise 13
#### Solution
We first choose values for \(\mu\) and \(\sigma\) (you can chose any values you want)
mu = 10
sigma = 2
Next we choose the values for the x axis. A good choice is to cover the range from \(\mu - 4 \sigma\) to \(\mu + \sigma\). We chose a stepsize of 0.1:
x = seq(mu-4*sigma,mu+4*sigma,by=0.1)
Now we compute the corresponding \(y\) values for our plot using the function dnorm:
y = dnorm(x, mean = mu, sd = sigma)
I now do the same thing for the standard normal distribution:
mu.std = 0
sigma.std = 1
x.std = seq(mu.std-4*sigma.std,mu.std+4*sigma.std,by=0.1)
y.std = dnorm(x.std, mean = 0, sd = 1)
And finally we plot both density functions next to each other:
par(mfrow = c(1,2))
plot(x,y,type="l",xlab="x",ylab="density")
plot(x.std,y.std,type="l",xlab="x",ylab="density")
We see that both curves look exactly the same. The only difference is the scale of the x and y axes.
random.sample = rnorm(1000,mu,sigma)
hist(random.sample,freq=F,ylim=c(0,0.2),breaks=30)
lines(x,y)
Q10 = qnorm(0.1,mean=mu,sd=sigma)
sum(random.sample<=Q10)/1000
## [1] 0.1
We find that approximately 10 percent of our random sample is smaller than \(Q10\). This is exactly the definition of a quantile!
### Exercise 14 a) Let \(X\) be a random variable that follows the standard normal distribution. Calculate the \(p = 0.05,0.1,0.15, ..., 0.95\) percentiles of the standard Normal distribution. Store the values in a vector \(Q\). Next calculate the cummulative distribution function for the values in \(Q\). Store this in a variable \(C\). Finally, plot \(p\) against \(Q\). What do you see? Can you explain? What if you use different values for the mean and the standard deviation?
p = seq(0.05,0.95,by=0.05)
Q = qnorm(p,mean=0,sd=1)
C = pnorm(Q,mean=0,sd=1)
plot(p,C)
We find that \(p = C\). This means that the cummulative distirbution function is the inverse of the quantile function.
and because
\[Q(p) = q_p\] if follows
that
\[F(Q(p)) = F(q_p) = p\] Here is a graphical represetantion of this relationship
cdf.std = pnorm(x.std,mean=0,sd=1)
plot(x.std,cdf.std,type="l",xlab="x",ylab = "P(X<x) ")
abline(v=qnorm(0.1,mean=0,sd=1),lwd=2,col="dodgerblue")
abline(h = 0.1,col="slategray",lwd=2)
text(-1.8,0.3,expression(Q[10]),col="dodgerblue")
text(-3.3,0.2,expression(P(X < Q[10])==0.1),col="slategray")
Plot the probability density function of a Normal distribution.
mu.std = 0
sigma.std = 1
x.std = seq(mu.std-4*sigma.std,mu.std+4*sigma.std,by=0.1)
y.std = dnorm(x.std, mean = 0, sd = 1)
plot(x.std,y.std,type="l")
c1 = qnorm(0.05,0,1)
abline(v = c1)
mu.std = 0
sigma.std = 1
x.std = seq(mu.std-4*sigma.std,mu.std+4*sigma.std,by=0.1)
y.std = dnorm(x.std, mean = 0, sd = 1)
plot(x.std,y.std,type="l")
c2 = qnorm(c(0.025,0.975),0,1)
abline(v=c2)
set.seed(1984)
n = 1000
random.sample = rnorm(n,0,1)
sum(random.sample < c2[1] | random.sample > c2[2])/n
## [1] 0.052
plot(x.std,y.std,type="l")
rug(random.sample)
rug(random.sample[random.sample < c2[1] | random.sample > c2[2]],col="red")
We see that roughly 5 eprcent fall into the exterme tails of the distirbution. this is snesible, becasue we have defined the extreme part here as the range \((-\infty,c_2] \cup [c_2,\infty]\), where \(-c_2\) (\(c_2\)) is the 2.5 (97.5) percentile of the distribution.
Use the stickleback data set. Calculate a confidence interval for the mean plate numbers for each genotype. Make a plot that shows the mean and the confidence interval of each genotype.
Now perform a t.test for each comparison of plate numbers between genotypes. Compare the outcome of the t.tests with the results from the confidence intervals. What do you find?
### Solution Part a)
We first calcualte the mean for each genotype:
stickleback = read.csv("~/Dropbox/Teaching/StatisticsForBiology/chap03e3SticklebackPlates.csv")
mean.mm = mean(stickleback$plates[stickleback$genotype=="mm"])
mean.Mm = mean(stickleback$plates[stickleback$genotype=="Mm"])
mean.MM = mean(stickleback$plates[stickleback$genotype=="MM"])
Next the upper and lower limit of the 95% CI using a t-test:
t.t.mm = t.test(stickleback$plates[stickleback$genotype=="mm"])
t.t.Mm = t.test(stickleback$plates[stickleback$genotype=="Mm"])
t.t.MM = t.test(stickleback$plates[stickleback$genotype=="MM"])
CI.mm = t.t.mm$conf.int
CI.Mm = t.t.Mm$conf.int
CI.MM = t.t.MM$conf.int
We put it all in a dataframe:
df = data.frame(means = c(mean.mm,mean.Mm,mean.MM),
lower = c(CI.mm[1],CI.Mm[1],CI.MM[1]),
upper = c(CI.mm[2],CI.Mm[2],CI.MM[2]),
genotypes = c("mm","Mm","MM"))
Now we can plot everything:
ggplot(data = df) +
geom_errorbar(mapping = aes(x = genotypes,ymin=lower, ymax=upper), width=.1) +
geom_point(mapping = aes(x = genotypes,y=means))
If you prefer (I don’t) ou can also do a bar plot
ggplot(data = df) +
geom_col(mapping = aes(x = genotypes,y=means,fill=genotypes)) +
geom_errorbar(mapping = aes(x = genotypes,ymin=lower, ymax=upper), width=.1) +
geom_point(mapping = aes(x = genotypes,y=means))
Either way, it is advised to also add the raw data:
ggplot(data = df) +
geom_jitter(data = stickleback,mapping = aes(x = genotype,y = plates,color=genotype),size=0.8,width=0.05)+
geom_errorbar(mapping = aes(x = genotypes,ymin=lower, ymax=upper), width=.2) +
geom_point(mapping = aes(x = genotypes,y=means))
Part b)
In part (a) we used a one sample t-test to calcualte the conficdence intervals. This test also tested the null hyptohesis that the means of each group are 0. If we want to comapre means between groups we need to do a two sample t-test:
t.test(stickleback$plates[stickleback$genotype=="mm"],
stickleback$plates[stickleback$genotype=="Mm"])
##
## Welch Two Sample t-test
##
## data: stickleback$plates[stickleback$genotype == "mm"] and stickleback$plates[stickleback$genotype == "Mm"]
## t = -32.001, df = 208.06, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -41.09355 -36.32416
## sample estimates:
## mean of x mean of y
## 11.67045 50.37931
t.test(stickleback$plates[stickleback$genotype=="mm"],
stickleback$plates[stickleback$genotype=="MM"])
##
## Welch Two Sample t-test
##
## data: stickleback$plates[stickleback$genotype == "mm"] and stickleback$plates[stickleback$genotype == "MM"]
## t = -95.49, df = 167.89, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -52.16670 -50.05336
## sample estimates:
## mean of x mean of y
## 11.67045 62.78049
t.test(stickleback$plates[stickleback$genotype=="Mm"],
stickleback$plates[stickleback$genotype=="MM"])
##
## Welch Two Sample t-test
##
## data: stickleback$plates[stickleback$genotype == "Mm"] and stickleback$plates[stickleback$genotype == "MM"]
## t = -10.262, df = 207.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -14.78364 -10.01871
## sample estimates:
## mean of x mean of y
## 50.37931 62.78049
We find that all t-test yield a significant result (small p values). This means that the null hyptohesis of the mean of two groups being the same can be rejected for all comparisons.
A shorter way is by using the function pairwise.t.tests. The First argument is the variable on which the t-test should be performed. The second parameters is the grouping factor: the categorical variable that separates the first argument into different groups.
pairwise.t.test(stickleback$plates,stickleback$genotype)
##
## Pairwise comparisons using t tests with pooled SD
##
## data: stickleback$plates and stickleback$genotype
##
## mm Mm
## Mm < 2e-16 -
## MM < 2e-16 1.5e-15
##
## P value adjustment method: holm
How does this finding relate to the confidence intervals? The confidence intervals are not overlapping, strongly suggesting that the means between groups are indeed different (at the 95 percent level). Thus, the t-test and the confidence interval indicate the same result.
Use the data set “chap11e2Stalkies.csv” of using eye span measurements from a sample of stalk-eyed flies. Perform a one sample t-test by “hand”, that is, clacualte the t statistic and the p-value in R, without using the function t.test (or a similar function). Test the null hypothesis that the mean eye span is equal to \(9\). Use a significance threshold of 0.05 and a two-sided test. Then check your results with the function t.test. Make a plot to illustrate your analysis.
stalkie <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter11/chap11e2Stalkies.csv"))
m0 = 9
M <- mean(stalkie$eyespan)
n <- length(stalkie$eyespan)
sigma <- sd(stalkie$eyespan)
SE <- sigma / sqrt(n)
test.statistic <- (M - m0) / SE
p.value = 2 * pt(-abs(test.statistic),df=8)
print(p.value)
## [1] 0.1327105
x=seq(-4,4,by=0.01)
y = dt(x,df=8)
plot(x,y,type="l")
abline(v = test.statistic,col="dodgerblue")
text(-3,0.2,"test statistic",col="dodgerblue")
text(2,0.3,"t-distribution,
with 8 degrees of freedom")
x1 = seq(-4,test.statistic,by=0.01)
y1 = dt(x1,df=8)
x2 = rev(x1)
y2 = y1*0
polygon(c(x1,x2),c(y1,y2),col = "red")
polygon(-c(x1,x2),c(y1,y2),col = "red")
text(-3.5,0.05,"P value",col="red")
The p value of the t-test is
test = t.test(stalkie$eyespan-9)
print(test$p.value)
## [1] 0.1327105
and the test statistic is
print(test$statistic)
## t
## -1.673772
We can summarize this in a table:
dat = data.frame(pValues = c(p.value,test$p.value),tStatistics = c(test.statistic,test$statistic))
colnames(dat) = c("P Value","Test Statistic")
rownames(dat) = c("by hand","using t.test()")
kableExtra::kable(dat)
| P Value | Test Statistic | |
|---|---|---|
| by hand | 0.1327105 | -1.673772 |
| using t.test() | 0.1327105 | -1.673772 |
Perform a goodness-of-fit test for a data set with two categories. The data file is “chap08e4XGeneContent.csv” and can be found on ilias. Start by reading and inspecting the data. Each row is a different gene, with its occurrence on the X chromosome indicated. Then test if the proportion of genes on the X chromsome (among all genes in the genome) is the same as the proportional length of the X chromosome measured in base pairs (among the whole genome). The relative size of the X chromosome is given by \(propX = 1055/20290\) and that of the other chromomoses is \(propRest = 19235/20290\).
Plot a frequency table showing the number of genes on the X and on other chromosomes.
Perform a \(\chi^2\) goodness-of-fit test to the proportional model.
Draw a conclusion from your test.
### Solution
geneContent <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08e4XGeneContent.csv"))
propX = 1055/20290
propAut = 19235/20290
head(geneContent)
## chromosome
## 1 X
## 2 X
## 3 X
## 4 X
## 5 X
## 6 X
geneContent$chromosome <- factor(geneContent$chromosome, levels = c("X","NotX"))
geneContentTable <- table(geneContent$chromosome)
barplot(geneContentTable)
ggplot(geneContent) +
geom_bar(mapping = aes(x = chromosome))
chisq.test( geneContentTable, p = c(propX,propAut))
##
## Chi-squared test for given probabilities
##
## data: geneContentTable
## X-squared = 75.065, df = 1, p-value < 2.2e-16
Intepretation:
Our tests shows that there is a very small \(p-value\) indicating that the null hypothesis
\(H_0:\) The proportion of genes on the X-chromomose is the same as the proportion of base pairs on the X-Chromosome.
can be rejected in favor of the alternative hypothesis
\(H_A:\) The proportion of genes on the X-chromomose is not the same as the proportion of base pairs on the X-Chromosome.
Load the dataset chap08e1DayOfBirth.csv. Each row represents a single birth, and shows the day of the week of birth. Test the null hypothesis that birthdays are randomly distributed in the sense that each day occurs with a probaility of \(1/7\). Visualize your results and give an interpretation of the test results. Can you come a up with a potential explanation?
birthDay <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08e1DayOfBirth.csv"))
head(birthDay)
## day
## 1 Sunday
## 2 Sunday
## 3 Sunday
## 4 Sunday
## 5 Sunday
## 6 Sunday
Put the days of the week in the correct order for tables and graphs.
birthDay$day <- factor(birthDay$day, levels = c("Sunday", "Monday",
"Tuesday", "Wednesday","Thursday","Friday","Saturday"))
birthDayTable <- table(birthDay$day)
data.frame(Frequency = addmargins(birthDayTable))
## Frequency.Var1 Frequency.Freq
## 1 Sunday 33
## 2 Monday 41
## 3 Tuesday 63
## 4 Wednesday 63
## 5 Thursday 47
## 6 Friday 56
## 7 Saturday 47
## 8 Sum 350
Bar graph of the birth data. The argument cex.names = 0.8 shrinks the names of the weekdays to 80% of the default size so that they fit in the graphics window – otherwise one or more names may be dropped.
barplot(birthDayTable, cex.names = 0.8)
shortNames = substr(names(birthDayTable), 1, 3)
barplot(table(birthDay$day), names = shortNames,
ylab = "Frequency", las = 1, col = "firebrick")
\(\chi^2\) goodness-of-fit test. The vector p is the expected proportions rather than the expected frequencies, and they must sum to 1 (R nevertheless uses the expected frequencies when calculating the test statistic).
chisq.test(birthDayTable, p = rep(1/7,times=7))
##
## Chi-squared test for given probabilities
##
## data: birthDayTable
## X-squared = 15.24, df = 6, p-value = 0.01847
We find a small \(p-value\) (< 0.05) so we can rejected the null hypothesis
$H_0: $ birthdays are uniformly distributed across weekdays.
in favor of the alternative hypothesis
$H_A: $ birthdays are not uniformly distributed across weekdays.
From inspecting tha data, it appears as if births are less common on Sundays. It could be that scheduled caesarean deliveries are not scheduled on weekends. Another reason could be that hospitals may sometimes be understaffed on weekends and hence there might be a tendency to wait a bit longer with inducing labor in some non-critical cases.
Note, however, that the p-value is not extremely small. We would reject the null hypothesis if we choose \(\alpha = 0.05\) but not if we chosse \(\alpha = 0.01\). This illustrates that your choice of rejecting or not rejecting a null hypothesis is strongly influenced by the degree of certainity you want to have in oyur decision.
Assume that Z is a number randomly chosen from a standard normal distribution. Calculate each of the following probabilities :
P(Z > 1.34)
P(Z < 1.34)
P(Z < 2.15)
P(Z < 1.2)
P(0.52 < Z < 2.34)
P(-2.34 < Z < -0.52)
P(Z < -0.93)
P(-1.57 < Z < - 0.32)
Make a sketch for each case and calculate the values using R functions.
Remember that pnorm(x) = P(Z < x) and that 1 - pnorm(x) = P(Z > x).
x = seq(-4,4,by=0.01)
y = dnorm(x,0,1)
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 1.34,col = "red")
text(2.3,0.05,"P(Z > 1.34)")
prob = 1 - pnorm(1.34,0,1)
prob
## [1] 0.09012267
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 1.34,col = "red")
text(0,0.05,"P(Z < 1.34)")
prob = pnorm(1.34,0,1)
prob
## [1] 0.9098773
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 2.15,col = "red")
text(0,0.05,"P(Z < 2.15)")
prob = pnorm(2.15,0,1)
prob
## [1] 0.9842224
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 1.2,col = "red")
text(0,0.05,"P(Z < 1.2)")
prob = pnorm(1.2,0,1)
prob
## [1] 0.8849303
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 2.34,col = "red")
abline(v = 0.52,col = "red")
text(1.5,0.3,"P(0.52 <
Z <
2.34)")
prob = pnorm(2.34,0,1) - pnorm(0.52,0,1)
prob
## [1] 0.2918899
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = -2.34,col = "red")
abline(v = -0.52,col = "red")
text(-1.5,0.3,"P(0.52 <
Z <
2.34)")
prob = pnorm(-0.52,0,1) - pnorm(-2.34,0,1)
prob
## [1] 0.2918899
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = -0.93,col = "red")
text(-2,0.3,"P(Z < -0.93)")
prob = pnorm(-0.93,0,1)
prob
## [1] 0.1761855
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = -1.57,col = "red")
abline(v = - 0.32,col = "red")
text(-1.5,0.3,"P(-1.57 <
Z <
- 0.32)")
prob = pnorm(-0.32,0,1) - pnorm(-1.57,0,1)
prob
## [1] 0.3162766
As the world warms, the geographic ranges of species might shift toward cooler areas. Chen et al. (2011) studied recent changes in the highest elevation of 31 taxa, in meters, over the late 1900s and early 2000s. Positive numbers indicate upward shifts in elevation, and negative numbers indicate shifts to lower elevations. The values are given by:
data = c(58.9, 7.8, 108.6, 44.8, 11.1, 19.2, 61.9, 30.5, 12.7, 35.8, 7.4, 39.3, 24.0, 62.1, 24.3, 55.3, 32.7, 65.3, -19.3, 7.6, -5.2, -2.1, 31.0, 69.0, 88.6, 39.5, 20.7, 89.0, 69.0, 64.9, 64.8)
n = length(data)
print(paste("The sample size is ",n))
## [1] "The sample size is 31"
m = mean(data)
print(paste("The mean shift is ",m, "meters."))
## [1] "The mean shift is 39.3290322580645 meters."
sd = sd(data)
print(paste("The staandard deviation of the shift is ",round(sd,digits=2)," meters."))
## [1] "The staandard deviation of the shift is 30.66 meters."
SE = sd/sqrt(n)
print(paste("The standard error of the mean is ",round(SE,digits=2),"."))
## [1] "The standard error of the mean is 5.51 ."
There are 30 (= \(n-1\)) degrees of freedom.
We can check with the t.test() function:
tt = t.test(data)
tt
##
## One Sample t-test
##
## data: data
## t = 7.1413, df = 30, p-value = 6.057e-08
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 28.08171 50.57635
## sample estimates:
## mean of x
## 39.32903
We can either use our results from the t-test before:
tt$conf.int
## [1] 28.08171 50.57635
## attr(,"conf.level")
## [1] 0.95
or
upper = m + 2*SE
lower = m - 2*SE
c(lower,upper)
## [1] 28.31452 50.34355
A more precise version:
upper = m + qt(0.975,df = n - 1)*SE
lower = m + qt(0.025,df = n - 1)*SE
c(lower,upper)
## [1] 28.08171 50.57635
\(H_0:\) The mean elevational shift is 0 meters. \(H_A:\) The mean elevational shift is not 0 meters.
We canu again just use t.test():
tt$statistic
## t
## 7.141309
or do it by hand:
mu.0 = 0 # as specified in the null hypothesis
t.stat = (m - mu.0)/SE
t.stat
## [1] 7.141309
The data should be approximately normally distributed. We can check this with a histogram
hist(data)
A slightly more precise way would be to do a QQ plot
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
qqPlot(data)
## [1] 3 19
As long as the points are within the dashed lines, you can safely do a t-test.
The most accurate way is either to use t-test again
tt$p.value
## [1] 6.056689e-08
We can also do it by hand. For this it is good to make a sketch first:
# we first calcualte the null distirbution of our t statisitc
x = seq(-8,8,by=0.01)
y = dt(x,df = n-1)
# and plot it
plot(x,y,type="l",xlab="t",ylab="density")
# we add the observed t statistic
abline(v = t.stat,col="slategray")
We see in the figure that the observed t statistic is far away from the mode of the t distribution. The p value shoul dtherfroe be very small. To calcualte it, we calcualte the area under the curve to the right of the statistic, and double this value to account for extreme events on the left end of the distirbution (we perform a two-sided test). The function pt would give us the area to the left of the test statistic, so we use 2*(1-pt(t.stat,df = n-1)) to get the p-value:
2*(1-pt(t.stat,df = n-1))
## [1] 6.056689e-08
We choose a signifcance threshold of 5 percent, i.e., we want to be 95 percent certain that we do not have a false positive. Th p-value is clearly smaller than that. Thus, the t-test shows that there is a statistically significant change in the highest elevation. Because the p-value is so small, we can be very certain of this result as it would alos hold at much lower signifcance thresholds.
A four-year review at the provincial Hospital in Alotau, Papua New Guinea (Barss 1984), found that 1/40 of their hospital admissons were injuries due to falling coconuts. If coconuts weigh on average 3.2 kg, and the upper bound of the 95% CI is 3.5 kg, what is the lower bound of this CI? Assume a normal distribution of coconut weights.
Under a normal distirbution, confidence intervalls are perfectly symmetric. Thus the lower bound must be 2.9 kg.
A forensic pathologist wants to know whether there is a difference between the rate of cooling of freshly killed bodies and those which were reheated, to determine whether you can detect an attempt to mislead a coroner about time of death. He tested several mice for their “cooling constant” both when the mouse was originally killed and then after the mouse was re-heated. The data can be found in the file “mouse_cooling.csv” on ilias. Test whether there is any difference in the cooling constants between freshly killed and reheated corpses. Check whether you can use a t-test and justify your decision to use it. If a t-test is notjsutified, try to use an alternative strategy (e.g., non-parametric test).
library(tidyverse)
mouse_cooling <- read.csv("~/Dropbox/Teaching/StatisticsForBiology/mouse_cooling.csv")
mouse_cooling = mutate(mouse_cooling,diff = freshly.killed - reheated)
ggplot(mouse_cooling) +
geom_histogram(aes(x = diff),bins=7)
The data is clearly right skewed. This can also be seen with a QQ plot:
library(car)
qqPlot(mouse_cooling$diff)
## [1] 15 2
The null hypothesis is
\(H_0:\) The true mean of the differences is \(\mu_0 = 0\).
We try the following log transformation: \(x' = log(100,100+x)\). Note that we add a constant so that we do not get negative values and then use the base 100 logarithm. The new value of our null hpyothesis is then \(\mu_0' = 1\) because \(x' = log(100,100+x)\) and therefore \(\mu_0' = log(100,100 + \mu_0) = 1\).
library(car)
qqPlot(log(100,100+mouse_cooling$diff))
## [1] 15 18
ggplot(mouse_cooling) +
geom_histogram(aes(x = log(100,100+diff)),bins=7)
This looks much better now and we perform a t-test on these data:
results = t.test(log(100,100+mouse_cooling$diff),mu = 1)
results
##
## One Sample t-test
##
## data: log(100, 100 + mouse_cooling$diff)
## t = -1.4466, df = 18, p-value = 0.1652
## alternative hypothesis: true mean is not equal to 1
## 95 percent confidence interval:
## 0.9254516 1.0137511
## sample estimates:
## mean of x
## 0.9696013
An alternative would be the sign test:
k = sum(mouse_cooling$diff<0)
results.sign = binom.test(k,n=19,p=0.5)
results.sign
##
## Exact binomial test
##
## data: k and 19
## number of successes = 7, number of trials = 19, p-value = 0.3593
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.1628859 0.6164221
## sample estimates:
## probability of success
## 0.3684211
In both cases we find no significant devation from 0 in the differnces. We thus cannot reject the null hypothesis.
For each of the following scenarios, identify the best statistical test to use and state the null hypothesis. (Please note, do not give the answer to the specific question, but simply state the best test to use and the null hypothesis for the scenario.)
Asking whether stickleback fish occur with equal probability through all areas of a pond. We have raken random samples at various locations of the pond (all the same size) and have recorded the number of fish in each sampled location.
We want to know which of two tumor treatments showed better results. We recorded the change in tumor size during the treatment in two groups of patients, group 1 was treated with drug A and group 2 with drug B.
We want to know whether a new tumor treatment showed good results. We recorded the change in tumor size before and after the treatment in each patient.
Asking whether the number of smokers in a list of cities is proportional to the population size of those cities.
Testing whether patients change in body mass during a hospital stay.
For each of the following scenarios, identify the best statistical test to use and state the null hypothesis. (Please note, do not give the answer to the specific question, but simply state the best test to use and the null hypothesis for the scenario.)
\(\chi^2\) Goodness of fit test H0: Fish numbers per unit area follow a Poisson distribution.
Two sample t-test or a non-parametric alternative. H0: There is no difference in the mean tumor size change in the two groups.
One sample t-test or a non-parametric alternative. H0: Themean tumor size change durign the treatment is 0.
\(\chi^2\) Goodness-of-fit test H0: The number of smokers in a city is proportional to the number of people in that city.
One-sample t-test H0: Mean change in patient body mass during hospital stay is zero.
Health spending per person from a randomsample of 20 countries is given in the file “healt_expenditure.csv” on ilias. We will use this sample to estimate the mean of log health expenditure, including a confidence interval.
Visualize the frequency distribtuion using a histogram. Does the data deviate from a normal distribution? If yes, how?
Why is a log transformation a good choice for these data?
What is the sample size?
Calcualte the natural logarithm of each data point.
What is the mean log health expenditure?
What is the standard deviation of the log health expenditure?
Calculate the standard error of the mean log health expenditure?
Calculate the 95% confidence interval of the mean log health expenditure.
What are the limits of this confidence interval expressed on original (non-log) scale?
health.expenditure <- read.csv("~/Dropbox/Teaching/StatisticsForBiology/health_expenditure.csv")
ggplot(health.expenditure) +
geom_histogram(aes(x = Per.capita.health.expenditure.in.2010),bins=10)
Becaue it is highly right skewed.
The sample size is
n = dim(health.expenditure)[1]
n
## [1] 18
data = mutate(health.expenditure,log.expend= log(Per.capita.health.expenditure.in.2010))
data$log.expend
## [1] 6.761573 5.768321 5.476464 6.782192 6.156979 4.276666 4.094345 6.408529
## [9] 3.850148 5.192957 5.509388 4.691348 7.436617 4.997212 5.888878 4.343805
## [17] 7.305860 6.522093
ggplot(data) +
geom_histogram(aes(x = log.expend),bins=4)
qqPlot(data$log.expend)
## [1] 13 9
The mean is
M =mean(data$log.expend)
M
## [1] 5.636854
The standard deviation is
SD = sd(data$log.expend)
SD
## [1] 1.110587
The standard error is
SE = SD/sqrt(n)
SE
## [1] 0.2617679
CI.log = t.test(data$log.expend)$conf.int
CI.log
## [1] 5.084572 6.189136
## attr(,"conf.level")
## [1] 0.95
or alterantively we can approximate it by:
CI.log.app = c(M-2*SE,M+2*SE)
CI.log.app
## [1] 5.113318 6.160390
CI.nonlog = exp(CI.log)
CI.nonlog
## [1] 161.5108 487.4248
## attr(,"conf.level")
## [1] 0.95
M.nonlog = mean(data$Per.capita.health.expenditure.in.2010)
hist(data$Per.capita.health.expenditure.in.2010,col="firebrick",main="",breaks=10)
abline(v = CI.nonlog,lty=2)
abline(v = M.nonlog,lwd=2)
Note how asymmetric the back transformed CI is - this reflects the asymmetry of the data. Be careful however, as this is an approximation! (Note: mean of logarithm transformed data is not the same as the logarithm of the mean of the data).
Recycling paper has some obvious benefits, but it may have unintended consequences. For example, perhaps people are less careful about how much paper they use if they know their waste will be recycled. Catlin and Wang (2013) tested this idea by measuring paper use in two groups of experimental participants. Each person was placed in a room alone with scissors, paper and a trash can, and was told that he or she was testing the scissors. In the “recycling” group only, there was also a recycling bin in the room. The amount of paper used by each participants was measured in grams.
No recycling bin: 4,4,4,4,4,4,4,5,8,9,9,9,9,12,12,13,14,14,14,14,15,23 With recycling bin: 4,5,8,8,8,9,9,9,12,14,14,15,16,19,23,28,40,43,129,130
Make and examine a histogram of these data. Are the frequency distributions of the two groups similar in shape and spread?
Based on (a), discuss your options for testing a difference between the groups.
Apply a Mann-Whitney U-test (R function wilcox.test). State the null and alternative hypotheses.
Calculate the P-value as accurately as you can and interpret the result.
no.recycling = c(4,4,4,4,4,4,4,5,8,9,9,9,9,12,12,13,14,14,14,14,15,23)
recycling = c(4,5,8,8,8,9,9,9,12,14,14,15,16,19,23,28,40,43,129,130)
n.rec = length(recycling)
n.no.rec = length(no.recycling)
group = rep(c("Recycling","No.Recycling"),times=c(n.rec,n.no.rec))
dat = data.frame(var = c(recycling,no.recycling),group = group)
ggplot(dat) +
geom_histogram(aes(x = var,fill=group))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(dat) +
geom_histogram(aes(x = var,fill = group),bins=12) +
facet_wrap(~group,nrow=2)
Both shape and spread seem to be different here, altough it is diffcult to tell based on the small sample sice. A Normal Distribution can however be ruled out, especially fro the recycling group.
ggplot(dat) +
geom_qq(aes(sample = var,col=group)) +
geom_qq_line(aes(sample = var,col=group))+
facet_wrap(~group,nrow=2)
qqPlot(recycling)
## [1] 20 19
qqPlot(no.recycling)
## [1] 22 21
c and d)
\(H_0:\) It is equally likely that a randomly selected value from one group will be less than or greater than a randomly selected value from a second population. \(H_A:\) It is not equally likely that a randomly selected value from one group will be less than or greater than a randomly selected value from a second population.
wilcox.test(var~group,data=dat)
## Warning in wilcox.test.default(x = c(4, 4, 4, 4, 4, 4, 4, 5, 8, 9, 9, 9, :
## cannot compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: var by group
## W = 126.5, p-value = 0.01825
## alternative hypothesis: true location shift is not equal to 0
# alternative without dataframes
wilcox.test(recycling,no.recycling)
## Warning in wilcox.test.default(recycling, no.recycling): cannot compute exact p-
## value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: recycling and no.recycling
## W = 313.5, p-value = 0.01825
## alternative hypothesis: true location shift is not equal to 0
For comparison, a t-test in this case would yield (not taht the t-test uses a different null hypothesis ):
t.test(var~group,data=dat)
##
## Welch Two Sample t-test
##
## data: var by group
## t = -2.1447, df = 19.681, p-value = 0.04465
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -34.9245559 -0.4663532
## sample estimates:
## mean in group No.Recycling mean in group Recycling
## 9.454545 27.150000
Both test actually give similar results, which is a good sign. We therefore reject the null hypotesis. A randomly selected bin from the no recycling group is more likely to contain more paper as compared to a randomly selected bin from the other group.
Use the data from exercise 26 and perform permutation test on the data. Visulaize the empirical sampling distribution and calcualte a P-value. Compare it to the results fromthe exercise 26. Hint: Use a for loop for the permutations. Alternatively, take a look at the package: perm and the function permTS.
We first calculate the actual difference of the means:
no.recycling = c(4,4,4,4,4,4,4,5,8,9,9,9,9,12,12,13,14,14,14,14,15,23)
recycling = c(4,5,8,8,8,9,9,9,12,14,14,15,16,19,23,28,40,43,129,130)
n.rec = length(recycling)
n.no.rec = length(no.recycling)
group = rep(c("Recycling","No.Recycling"),times=c(n.rec,n.no.rec))
dat = data.frame(var = c(recycling,no.recycling),group = group)
MeanR = mean(no.recycling)
MeanNR = mean(recycling)
diffMeans = MeanNR - MeanR
nPerm = 1000
permResult <- vector() # initializes
for(i in 1:nPerm){
# step 1: permute the group labels
permSample <- sample(dat$group, replace = FALSE)
# step 2: calculate difference betweeen means
permMeanR <-mean(dat$var[permSample=="Recycling"])
permMeanNR <-mean(dat$var[permSample=="No.Recycling"])
permResult[i] <- permMeanNR - permMeanR
}
Plot null distribution based on the permuted differences.
hist(permResult, right = FALSE, breaks = 100,col="darkred")
Note how the distirbuion looks odd. This is due to the small sample size: \(n_1 = 20, n_2 =22\). More precily, this is problematic because there are few extreme values in the data set (129 and 130 in the recycling dataset). These exterem points will ahve a strong impact on the empirical null distirbution: if both are in the same resampled group, they will shift the whole distirbution strongly to the left or right. This causes the bimodal distribution. This is a sign that the sample size might be too small for a permutation test. However, usually this leads to small power and large p-values. So we cannevertheless use the null distribution to calculate an approximate P-value, we just need to be aware that our results will be very conservative. The p value is twice the proportion of the permuted means that fall below the observed difference in means, diffMeans. The following code calculates the number of permuted means falling below diffMeans.
sum(as.numeric(permResult >= diffMeans))
## [1] 1
These commands obtain the fraction of permuted means falling below diffMeans.
sum(as.numeric(permResult >= diffMeans)) / nPerm
## [1] 0.001
Finally, multiply by 2 to get the P-value for a two-sided test.
2 * ( sum(as.numeric(permResult > diffMeans)) / nPerm )
## [1] 0.002
We can illustrate the permutaed differences and add the observed difference on to show the outcome of our test: the birght red colored bars indicate the proportion of permutation results that yielded a more exterme results as the one observed.
hist(permResult[permResult<=diffMeans],breaks=seq(-20,20,by=1),col="darkred",main="")
hist(permResult[permResult>diffMeans],breaks=seq(-20,20,by=1),add=T,col="red")
abline(v = diffMeans,col="red",lwd=2)
An alternative approach is
library(perm)
permTS(dat$var[dat$group=="Recycling"],dat$var[dat$group=="No.Recycling"],exact = T)
##
## Exact Permutation Test Estimated by Monte Carlo
##
## data: GROUP 1 and GROUP 2
## p-value = 0.004
## alternative hypothesis: true mean GROUP 1 - mean GROUP 2 is not equal to 0
## sample estimates:
## mean GROUP 1 - mean GROUP 2
## 17.69545
##
## p-value estimated from 999 Monte Carlo replications
## 99 percent confidence interval on p-value:
## 1.003509e-05 1.482735e-02
We again find that we can reject the Null Hypothesis.